Introduction

This kaggle competition asks the user to predict housing prices. The core dataset is show below. It carries 80 possible explanatory features for housing prices, split roughly evenly between numerical and categorical variables.

train <- read_csv('data/train.csv')
test <- read_csv('data/test.csv')
paged_table(train)

The first thing we notice is that there is a substantial amount of variation is our target variable, with a significant amount of leftward skew. We’ll tackle this skew later on.

library(plotly)

density <- density(train$SalePrice)

fig <- plot_ly(x = ~density$x, y = ~density$y, type = 'scatter', mode = 'lines', fill = 'tozeroy')
fig <- fig %>% layout(xaxis = list(title = 'SalePrice'),
         yaxis = list(title = 'Density'))

fig

Dataset preparation

There are three main issues that need to be addressed before we can train our ML models:

Let’s tackle missingess first.

Missingness

In the table below, we see missingness in ~1/4 (19/79) explanatory variables. ~90% of this missingness comes from 5 variables:

  • FireplaceQu
  • Fence
  • Alley
  • MiscFeature
  • PoolQC
library(naniar)
miss_var_summary(train) %>% 
  mutate(cum_pct = cumsum(n_miss)/sum(n_miss)) %>%
  filter(n_miss>0) %>% 
  paged_table(.)

When we inspect the data description, we quickly see that almost all of this missingness is not true missingness, but rather tied to the way the data was encoded. For example:

  • PoolQC: NA means “No pool”
  • FireplaceQu: NA means “No Fireplace”
  • Alley: NA means “No alley access”
  • Fence: NA means “No fence”
  • MiscFeature: NA means “None”
  • GarageType, GarageFinish, GarageQual, GarageCond: NA means “No garage”
  • BasmtQual, BsmtCond, BsmtFinType1, BsmtFinType2: NA means “No basement”
correct_NA = function(data) {
  col_list <- list()
  col_list[["PoolQC"]] <- "No pool"
  col_list[["FireplaceQu"]] <- "No fireplace"
  col_list[["Alley"]] <- "No alley"
  col_list[["Fence"]] <- "Fence"
  col_list[["MiscFeature"]] <- "None"
  for (col in c("GarageType","GarageFinish","GarageQual","GarageCond")) {
    col_list[[col]] <- "No garage"
  }
  for (col in c("BsmtQual","BsmtCond", "BsmtFinType1", "BsmtFinType2", "BsmtExposure")) {
    col_list[[col]] <- "No basement"
  }
  col_list[["LotFrontage"]] <- 0
  col_list[["GarageYrBlt"]] <- 0
  
  data %>% 
    tidyr::replace_na(col_list) %>%
    return(.)
}
train2 <- correct_NA(train) 

After making these adjustments, we see that the true missingness is actually quite limited (~0.1% of the sample).

train2 %>%
  miss_var_summary(.) %>%
  filter(n_miss >0) %>% 
  paged_table(.)

In theory, our tree-based algorithms may handle this directly, with minimal loss of generality. Unfortunately, when we run the same adjustment scheme on the test set, we find there is significantly more missingness (0.04%), including on several variables (e.g. MSZoning) that show no missingness in the train set.

test2 <- correct_NA(test) 
test2 %>% 
  miss_var_summary(.) %>%
  filter(n_miss >0) %>% 
  paged_table(.)

To address this, we will use the missForest package, which implements random forest imputation in R. Its main advantage over its Python equivalent is that it handles categorical variables directly without the need to use oneHotEncoder or other dummification schemes. To use this package, we first need to convert our categorical variables into factors. We also need to ensure that our numerical variables are not just numerically encoded factors. After inspecting the data_description.txt, we perform the conversion below.

factor_conversion <- function(data) {
  data %>% 
    mutate_at(vars(MSSubClass), ~paste0(as.character(.),"v")) %>% 
    mutate(across(where(is.character), as.factor))  %>% 
    as.data.frame(.) %>% 
    return(.)
}
train3 <- factor_conversion(train2)
test3 <- factor_conversion(test2)

Then we perform the imputation.

library(missForest)
library(doParallel)
registerDoParallel(cores=6)
miss_train <- missForest(train3, parallelize = 'forests') 
train4 <- miss_train$ximp
miss_test <- missForest(test3, parallelize = 'forests')
test4 <- miss_test$ximp

We note that while the OOB error is very small in both cases, it is much larger in the test set than the train set:

bind_rows(miss_train$OOBerror,miss_test$OOBerror) %>% 
  bind_cols(data.frame(Dataset = c("Train","Test")),.) %>% 
  paged_table()

Multicollinearity

Given there are 80 explanatory variables in the dataset, it is unsurprising that multicollinearity is a significant concern. To size the issue, we must first define correlation in the context of a mix of categorical and numerical variables. We do so by leveraging the GoodmanKruskal package, which implements Goodman and Kruskal’s tau measure. The tau measure is a association statistic that allows for (1) all variable types and (2) asymmetric relationships. It represents the proportion of variation in one variable that can be explained by another. Below is the cumulative distribution of tau measure across variable pairs:

library(mltools)
library(GoodmanKruskal)
gktau <- train4 %>% 
  GKtauDataframe(.)
gkvec <- gktau %>% as.vector(.)
gkcdf <- gkvec[gkvec <= 1] %>% 
  empirical_cdf(.,ubounds=seq(0, 1, by=0.001)) 
plot_ly(x = ~gkcdf$UpperBound, y = ~gkcdf$CDF, type = 'scatter', mode = 'lines', fill = 'tozeroy') %>% 
   layout(xaxis = list(title = 'Goodman-Kruskal tau'),
         yaxis = list(title = 'Percent of variable relationships'))

As it turns out, multicollinearity is not as big of an issue here as would be feared. Only 33% of variable pairs explain more than 5% of the variation in one another. And only 17% of variable pairs explain more than 20% of the variance in one another.

To confirm this finding with a more intuitive association measure, let’s consider the distribution of Pearson correlation coefficients among the numeric variables. Below is the analogous cumulative distribution plot:

pears_vec <- lsr::correlate(train4) %>% 
  .$correlation %>% 
  as.vector(.) %>% 
  na.omit(.) %>% 
  abs(.)
pears_cdf <- pears_vec[pears_vec <= 1] %>% 
  empirical_cdf(.,ubounds=seq(0, 1, by=0.001))

plot_ly(x = ~pears_cdf$UpperBound, y = ~pears_cdf$CDF, type = 'scatter', mode = 'lines', fill = 'tozeroy') %>% 
   layout(xaxis = list(title = 'Absolute value of Pearson correlation coefficient'),
         yaxis = list(title = 'Percent of variable relationships'))

Only 25% of numerical variable pairs have a Pearson correlation coefficient > 0.2.

This bodes well for the accuracy of our predictions, but likely means we will need to consider many variables in our model. In other words, features selection is likely to be particularly challenging. Rather than perform this in a separate step, this paper implements the Boruta algorithm to help automate feature selection within the larger model training process. That said, which features drive housing prices is a core insight of this paper, one that we’ll return to after we fit our models.

Target normality

While tree-based models can handle asymmetric target variables, transforming these variable can make models more accurate, especially in this case. The reason: our model will be evaluated on RMSE, so the long tail we are worried about (high-priced homes) are likely to have an outsize impact on the overall error figure. In this paper, we’ll apply a log transformation using the TransformedTargetRegressor found in Scikit-learn. Upon inspection below, we can see that this produces a reasonably well-distributed variable without invoking more complex transformations like the quantile.

density <- density(log(train4$SalePrice))

fig <- plot_ly(x = ~density$x, y = ~density$y, type = 'scatter', mode = 'lines', fill = 'tozeroy')
fig <- fig %>% layout(xaxis = list(title = 'Log SalePrice'),
         yaxis = list(title = 'Density'))

fig

Model training

We are finally ready to train our model. We’ll be applying three tree-based models:

Dataset preparation

We must build our explanatory and target matrices in Python.

import numpy as np
from sklearn.preprocessing import quantile_transform
X = r.train4.iloc[:,1:-1] #dropping ID variable
y = r.train4.iloc[:,-1:]
y_transformed = np.log1p(y)

Feature selection

Because hyperparameter tuning and other model steps are very computationally involved, this paper chooses to perform the feature selection step first. This is an imperfect choice, but should allow for much more reasonable calculation times on a personal computer. To do so, we’ll leverage a relatively unknown package called BoostaRoota that allows for a quick implementation of XGBoost and other boosting algorithms into the Boruta framework. In particular, it automates categorical variable encoding, which significantly reduces user complexity.

import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder, QuantileTransformer, quantile_transform
import xgboost as xgb
from boostaroota import BoostARoota

br = BoostARoota(metric = "rmse")
X_xgb = pd.get_dummies(X)
br.fit(X_xgb,y_transformed)

This variable selection returns ~16% of the total dummified variables. Interestingly though, a little over half of the underlying variables are included. In effect, most of the dimensionality reduction comes from reducing dummified options of categorical variables.

br.transform(X_xgb)
len(br.keep_vars_)
new_vars = np.unique([i.split("_")[0] for i in br.keep_vars_])
X_cat = X[new_vars]
len(new_vars)

Below are the variables we are keeping at least partially:

list(new_vars)
## ['1stFlrSF', 'BedroomAbvGr', 'BsmtCond', 'BsmtFinSF1', 'BsmtFullBath', 'BsmtQual', 'BsmtUnfSF', 'CentralAir', 'Condition1', 'Electrical', 'EnclosedPorch', 'ExterCond', 'Exterior1st', 'Exterior2nd', 'Fence', 'Fireplaces', 'Functional', 'GarageArea', 'GarageCars', 'GarageQual', 'GarageType', 'GarageYrBlt', 'GrLivArea', 'HalfBath', 'Heating', 'KitchenAbvGr', 'LandContour', 'LotArea', 'LotFrontage', 'MSSubClass', 'MSZoning', 'MoSold', 'Neighborhood', 'OpenPorchSF', 'OverallCond', 'OverallQual', 'PavedDrive', 'SaleCondition', 'SaleType', 'Street', 'TotalBsmtSF', 'YearBuilt', 'YearRemodAdd']

And here are the variables we are eliminating:

lst = list(set(X.columns.values).difference(set(new_vars)))
lst.sort()
lst
## ['2ndFlrSF', '3SsnPorch', 'Alley', 'BldgType', 'BsmtExposure', 'BsmtFinSF2', 'BsmtFinType1', 'BsmtFinType2', 'BsmtHalfBath', 'Condition2', 'ExterQual', 'FireplaceQu', 'Foundation', 'FullBath', 'GarageCond', 'GarageFinish', 'HeatingQC', 'HouseStyle', 'KitchenQual', 'LandSlope', 'LotConfig', 'LotShape', 'LowQualFinSF', 'MasVnrArea', 'MasVnrType', 'MiscFeature', 'MiscVal', 'PoolArea', 'PoolQC', 'RoofMatl', 'RoofStyle', 'ScreenPorch', 'TotRmsAbvGrd', 'Utilities', 'WoodDeckSF', 'YrSold']

The major trends here are:

  • For characteristics represented multiple variables (e.g. basements), the model tends to keep only the core subset of the variables (e.g. BsmtCond, BsmtQual) while throwing out more ancillary variables (e.g. BsmtHalfBath)

  • Foundational aspects of the home (e.g. utilities, foundation, roofing) tend to be thrown out, whereas more visual features (e.g. Lot area, fencing, fireplaces, bedroom counts, age) tend to be included.

  • There are some notable exceptions to the rule (e.g. KitchenQual and PoolQual) that might make some homeowners think twice about renovations

The list includes some obvious entries (e.g. Street), but some very insightful ones (e.g. Pool metrics, roofing, heating, and electrical). We will return to the feature importance discussion after running our models.

Hyperparameter tuning

Now, we will tune our models, paying particular attention to the risk of over-fitting. Because of the complexity and continuity of the search space for tree-based model parameters, we’ll use Optuna to help us reach optimal parameters faster.

XGBoost

For XGBoost, we use an approach inspired by this kaggle post with parameter choices ranges adapted from this article. Rather than using cross_val_score, we integrate the train_test_split directly into the objective function. The traditional cross_val_score approach tends to produce errors

from xgboost import XGBRegressor
import optuna
from sklearn.metrics import mean_squared_error
from sklearn.compose import TransformedTargetRegressor
from sklearn.model_selection import train_test_split

def objective(trial):
  train_X, test_X, train_y, test_y = train_test_split(X_xgb, y, test_size=0.2,random_state=42)
  params = {
  "verbosity": 0,  # 0 (silent) - 3 (debug)
  # "objective": "reg:squarederror",
  "n_estimators": 10000,
  "max_depth": trial.suggest_int("max_depth", 4, 20),
  "learning_rate": trial.suggest_loguniform("learning_rate", 0.005, 0.05),
  "colsample_bytree": trial.suggest_loguniform("colsample_bytree", 0.2, 0.6),
  "subsample": trial.suggest_loguniform("subsample", 0.4, 0.8),
  "alpha": trial.suggest_loguniform("alpha", 0.01, 10.0),
  "lambda": trial.suggest_loguniform("lambda", 1e-8, 10.0),
  "gamma": trial.suggest_loguniform("lambda", 1e-8, 10.0),
  "min_child_weight": trial.suggest_loguniform("min_child_weight", 10, 1000),
  "random_state": 42
  }
  xgb_model = xgb.XGBRegressor(**params)
  transformed_xgb_model = TransformedTargetRegressor(regressor = xgb_model, func = np.log1p, inverse_func = np.expm1)
  transformed_xgb_model.fit(train_X,train_y, eval_set=[(test_X,test_y)],early_stopping_rounds=100,verbose=False)
  
  preds = transformed_xgb_model.predict(test_X)
  rmse = mean_squared_error(test_y, preds,squared=False)
    
  return rmse

study1 = optuna.create_study(direction='minimize')
study1.optimize(objective, n_trials=10)
xgb_params = study1.best_params

Here are our parameters for the xgboost regression, which we will use again in the next section.

xgb_params
## {'max_depth': 12, 'learning_rate': 0.01987725262501436, 'colsample_bytree': 0.4878537637478309, 'subsample': 0.6280749768145197, 'alpha': 0.21279988234217728, 'lambda': 0.0012346055168196114, 'min_child_weight': 17.237166267774228}

Light GBM regression

For LightGBM, we leverage the same approach above, using hyperparameter ranges adapted from this kaggle post.

import lightgbm as lgb
def objective(trial):
  train_X, test_X, train_y, test_y = train_test_split(X_cat, y, test_size=0.2,random_state=42)
  params = {
    'learning_rate': trial.suggest_loguniform('learning_rate',0.01, 1.0),
    'num_leaves': trial.suggest_int('num_leaves',24, 80),
    'feature_fraction': trial.suggest_uniform('feature_fraction',0.1, 0.9),
    'bagging_fraction': trial.suggest_uniform('bagging_fraction',0.8, 1),
    'max_depth': trial.suggest_int('max_depth',5, 30),
    'max_bin':trial.suggest_int('max_bin',20,90),
    'min_data_in_leaf': trial.suggest_int('min_data_in_leaf',20, 80),
    'min_sum_hessian_in_leaf':trial.suggest_uniform('min_sum_hessian_in_leaf',0,100),
    'subsample': trial.suggest_uniform('subsample',0.01, 1.0),
    'random_state': 42,
    'metric': 'rmse',
    'application':'regression'
    }
  lgb_model = lgb.LGBMRegressor(**params)
  transformed_lgb_model = TransformedTargetRegressor(regressor = lgb_model, func = np.log1p, inverse_func = np.expm1)
  transformed_lgb_model.fit(train_X,train_y, eval_set=[(test_X,test_y)],early_stopping_rounds=100,verbose=False)
  preds = transformed_lgb_model.predict(test_X)
  rmse = mean_squared_error(test_y, preds,squared=False)
    
  return rmse
study2 = optuna.create_study(direction='minimize')
study2.optimize(objective, n_trials=500)
lgb_params = study2.best_params

Here are our parameters for the LightGBM regression, which we will use again in the next section.

lgb_params
## {'learning_rate': 0.18475547283355995, 'num_leaves': 32, 'feature_fraction': 0.38498604907833156, 'bagging_fraction': 0.9673419291655969, 'max_depth': 15, 'max_bin': 81, 'min_data_in_leaf': 23, 'min_sum_hessian_in_leaf': 17.671558526095062, 'subsample': 0.6726629225849623}

CatBoost

CatBoost has hyperparameter tuning built-in, so no further effort is required here.

Final training

With our hyperparameters tuned, we may evaluate our models using cross_val_score().

XGBoost

import xgboost as xgb
xgb_model = xgb.XGBRegressor(**xgb_params)
transformed_xgb_model = TransformedTargetRegressor(regressor = xgb_model, func = np.log1p, inverse_func = np.expm1)
xgb_mean = -np.mean(cross_val_score(transformed_xgb_model,X_xgb,y, cv = 5, scoring = 'neg_root_mean_squared_error'))
xgb_mean
## 160172.86852335744

LightGBM

lgb_model = lgb.LGBMRegressor(**lgb_params)
transformed_lgb_model = TransformedTargetRegressor(regressor = lgb_model, func = np.log1p, inverse_func = np.expm1)
lgb_rmse = -np.mean(cross_val_score(transformed_lgb_model,X_cat,y, cv = 5, scoring = 'neg_root_mean_squared_error'))
lgb_rmse
## 26613.623011477048

CatBoost

from catboost import CatBoostRegressor
cat_model = CatBoostRegressor(verbose = 0, n_estimators = 100, cat_features=list(X_cat.dtypes[X_cat.dtypes=="category"].index))
transformed_cat_model = TransformedTargetRegressor(regressor = cat_model, func = np.log1p, inverse_func = np.expm1)
cat_rmse = -np.mean(cross_val_score(transformed_cat_model,X_cat,y, cv = 5, scoring = 'neg_root_mean_squared_error'))
cat_rmse
## 26141.48757894689

LightGBM and CatBoost appear to perform much better than XGBoost, so we will use these in our Kaggle submission. Some of XGBoost’s poor performance may be due to:

  1. Its slow hyperparameter tuning speed, which allowed for fewer training iterations in a reasonable amount of time on a personal computer.
  2. Its poorer handling of categorical features, which are very prominent in this dataset.

Results

Kaggle results

We are now ready to test our models against the kaggle submission set

Data preparation and final model training

testX = r.test4.iloc[:,1:] #dropping ID variable
testX_cat = testX[new_vars]
transformed_lgb_model.fit(X_cat,y)
transformed_cat_model.fit(X_cat,y)

Create submission files

lgb_results = transformed_lgb_model.predict(testX_cat)
cat_results = transformed_cat_model.predict(testX_cat)
ids = range(1461,1461+1459)
pd.DataFrame({'ID': ids,'SalePrice': [i[0] for i in lgb_results]}).to_csv("kaggle_submit/lightgbm.csv",index = False)
pd.DataFrame({'ID': ids,'SalePrice': [i[0] for i in cat_results]}).to_csv("kaggle_submit/catboost.csv",index = False)

Submission results

Catboost performed slightly better, with an RMSE of 0.13119, vs. lightGBM’s result of 0.13620. Either way, both placed in the top ~40% of submissions. By no means ideal, but a reasonably exciting result!

Feature importance

LightGBM

Below are the top features from the light GBM model

import matplotlib.pyplot as plt
importances = transformed_lgb_model.regressor_.feature_importances_
importances = [val/sum(importances) for val in importances]
results = pd.DataFrame({'Features': X_cat.columns,
                        'Importances': importances})
results.sort_values(by='Importances', inplace=True, ascending = False)
py$results %>% paged_table()

Catboost

import matplotlib.pyplot as plt
importances = transformed_cat_model.regressor_.feature_importances_
importances = [val/sum(importances) for val in importances]
results2 = pd.DataFrame({'Features': X_cat.columns,
                        'Importances': importances})
results2.sort_values(by='Importances', inplace=True, ascending = False)
py$results2 %>% paged_table()

Discussion

There are some common themes and notable differences in the sets of features listed above. Both list square footage above ground (GrLivArea) as one of the most important predictors of housing prices. Also both discount some commonly assumed factors as minimally important, such as:

  • Year Built
  • Type of sale
  • Kitchen metrics

But there are also some very notable differences. Beyond the square footage, CatBoost concentrates its importance in the Overall Quality rating of the home. LightGBM, takes a more nuanced approach, leveraging multiple factors that might imply quality but are most directly tied to home square footage. In fact, factors 2 - 8 for LightGBM are all tied to the size of the home. This is unfortunate news for would-be renovators, since improving room quality is much easier than increasing square footage. These renovators would hope that CatBoost’s interpretation is correct. That said, the fact that both models produce similar accuracy in the test set may suggest that both capture some important aspect of the market.